Value activation for bias alleviation: Generalized-activated deep double deterministic policy gradients

نویسندگان

چکیده

It is vital to accurately estimate the value function in Deep Reinforcement Learning (DRL) such that agent could execute proper actions instead of suboptimal ones. However, existing actor-critic methods suffer more or less from underestimation bias overestimation bias, which negatively affect their performance. In this paper, we reveal a simple but effective principle: correction benefits alleviation, where propose generalized-activated weighting operator uses any non-decreasing function, namely activation as weights for better estimation. Particularly, integrate into estimation and introduce novel algorithm, Generalized-activated Double Deterministic Policy Gradients (GD3). We theoretically show GD3 capable alleviating potential bias. interestingly find functions lead satisfying performance with no additional tricks, contribute faster convergence. Experimental results on numerous challenging continuous control tasks task-specific outperforms common baseline methods. also uncover fact fine-tuning polynomial achieves superior most tasks. Codes will be available upon publication.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distributed Distributional Deterministic Policy Gradients

This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG. We also combine this technique with a number of additional, simple impr...

متن کامل

Revisiting stochastic off-policy action-value gradients

Off-policy stochastic actor-critic methods rely on approximating the stochastic policy gradient in order to derive an optimal policy. One may also derive the optimal policy by approximating the action-value gradient. The use of action-value gradients is desirable as policy improvement occurs along the direction of steepest ascent. This has been studied extensively within the context of natural ...

متن کامل

Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies

This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the val...

متن کامل

Deep Deterministic Policy Gradient for Urban Traffic Light Control

Traffic light timing optimization is still an active line of research despite the wealth of scientific literature on the topic, and the problem remains unsolved for any non-toy scenario. One of the key issues with traffic light optimization is the large scale of the input information that is available for the controlling agent, namely all the traffic data that is continually sampled by the traf...

متن کامل

Policy Gradients for Cryptanalysis

So-called Physical Unclonable Functions are an emerging, new cryptographic and security primitive. They can potentially replace secret binary keys in vulnerable hardware systems and have other security advantages. In this paper, we deal with the cryptanalysis of this new primitive by use of machine learning methods. In particular, we investigate to what extent the security of circuit-based PUFs...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Neurocomputing

سال: 2023

ISSN: ['0925-2312', '1872-8286']

DOI: https://doi.org/10.1016/j.neucom.2022.10.085